Preconditioned Spectral Descent for Deep Learning: Supplemental Material
نویسندگان
چکیده
Table 1: Parameter Settings for Learning RBMs. RMSprop parameters chosen to match [5]. SGD parameters chosen to match [26]. SSD and A-SSD stepsizes and geometries chosen to match [1]. The stepsize on W is given for the RBM. λ corresponds to the damping factor in the history terms in the ADA and RMS methods. Projections refers to the numbers of projections used in the Random SVD algorithm [9] in the approximate #-operator (Section 2.4).
منابع مشابه
Preconditioned Spectral Descent for Deep Learning
Deep learning presents notorious computational challenges. These challenges include, but are not limited to, the non-convexity of learning objectives and estimating the quantities needed for optimization algorithms, such as gradients. While we do not address the non-convexity, we present an optimization solution that exploits the so far unused “geometry” in the objective function in order to be...
متن کاملPreconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks
Effective training of deep neural networks suffers from two main issues. The first is that the parameter spaces of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is ...
متن کاملPreconditioned Stochastic Gradient Descent
Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is possible to precondition SGD to accelerate its convergence remarkably. But many attempts in this direction either aim at solving specialized problems, or result in significantly more complicated methods than SGD. This paper proposes a new method t...
متن کاملStochastic Spectral Descent for Restricted Boltzmann Machines
Restricted Boltzmann Machines (RBMs) are widely used as building blocks for deep learning models. Learning typically proceeds by using stochastic gradient descent, and the gradients are estimated with sampling methods. However, the gradient estimation is a computational bottleneck, so better use of the gradients will speed up the descent algorithm. To this end, we first derive upper bounds on t...
متن کاملShampoo: Preconditioned Stochastic Tensor Optimization
Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates ...
متن کامل